Application of machine learning algorithms to the study of noise artifacts in gravitational- wave data 
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The sensitivity of searches for astrophysical transients in data from the Laser Interferometer Gravitational- 
wave Observatory (LIGO) is generally limited by the presence of transient, non-Gaussian noise artifacts, which 
occur at a high-enough rate such that accidental coincidence across multiple detectors is non-negligible. Further- 
more, non-Gaussian noise artifacts typically dominate over the background contributed from stationary noise. 
These "glitches" can easily be confused for transient gravitational-wave signals, and their robust identification 
and removal will help any search for astrophysical gravitational-waves. We apply Machine Learning Algorithms 
(MLAs) to the problem, using data from auxiliary channels within the LIGO detectors that monitor degrees of 
freedom unaffected by astrophysical signals. Terrestrial noise sources may manifest characteristic disturbances 
in these auxiliary channels, inducing non-trivial correlations with glitches in the gravitational-wave data. The 
number of auxiliary-channel parameters describing these disturbances may also be extremely large; an area 
where MLAs are particularly well-suited. We demonstrate the feasibility and applicability of three very dif- 
ferent MLAs: Artificial Neural Networks, Support Vector Machines, and Random Forests. These classifiers 
identify and remove a substantial fraction of the glitches present in two very different data sets: four weeks 
of LIGO's fourth science run and one week of LIGO's sixth science run. We observe that all three algorithms 
agree on which events are glitches to within 10% for the sixth science run data, and support this by showing 
that the different optimization criteria used by each classifier generate the same decision surface, based on a 
likelihood-ratio statistic. Furthermore, we find that all classifiers obtain similar limiting performance, suggest- 
ing that most of the useful information currently contained in the auxiliary channel parameters we extract is 
already being used. Future performance gains are thus likely to involve additional sources of information, rather 
than improvements in the MLAs themselves. 



I. INTRODUCTION 

The Laser Interferometer Gravitational-wave Observa- 
tory (LIGO) is a two-site network of ground-based detec- 
tors designed for the direct detection and measurement of 
gravitational-wave signals from astrophysical sources lUE). 
The LIGO detectors, in their initial configuration [l], have op- 
erated since 2001 and conducted several scientific runs, col- 
lecting data with incrementally increased sensitivity in each 
run PI 131, 4|. Although no gravitational waves were detected, 
these runs tested and refined key technologies, as well as pro- 
vided a large amount of data characteiizing the detectors. The 
next generation of detectors, refeiTed to as the advanced LIGO 
detectors, are currently under construction and are expected to 
be operational by 2015 [1 , 2|. Major upgrades to lasers, op- 
tics, and seismic isolation/sensing will provide roughly a fac- 
tor of ten improvement to sensitivity, which corresponds to a 
factor of 1000 in the observable volume of space and the num- 
ber of detectable sources. Based on our cuiTent knowledge of 
potential astrophysical sources, the advanced LIGO detectors 
are expected to make routine gravitational-wave detections 
(see for example fSj) and will open the era of gravitational- 
wave astronomy. 



The LIGO detector noise may be characterized by an ap- 
proximately stationary component of colored Gaussian noise, 
with the addition of short duration non-Gaussian noise ar- 
tifacts called "glitches" (other noise sources such as non- 
stationary lines and broadband non-stationarity do not always 
fit neatly into this framework). The stationary noise in the 
instrument is dominated by low-frequency seismic noise cou- 
pling to miiTor motion, thermal noise in the miiTors and sus- 
pensions, 60 Hz power lines and harmonics, and shot noise. 
Sources of transient noise can include temporary seismic, 
acoustic, or magnetic disturbances, power transients, scattered 
light, dust crossing the beam, instabilities in the interferome- 
ter, channel saturations, and other complicated and often non- 
linear effects. To monitor these disturbances and keep the in- 
strument in a stable operating condition through active feed- 
back, each detector records hundreds of auxiliary channels 
along with the gravitational-wave channel. These auxiliary 
channels keep track of important non-gravitational-wave de- 
grees of freedom in the interferometer, as well as information 
about the local environment. They are critical to understand- 
ing the state of the instrument at any particular time. 

One of LIGO's main scientific goals is the detection of 
transient gravitational- wave signals, which can come from 
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the coalescence of a compact binary or core-collapse super- 
nova, among other astrophysical sources |6|. The presence 
of glitches is problematic for searches targeting these sig- 
nals because glitches can be easily confused with transient 
gravitational-wave signals. The primary method to distin- 
guish a real gravitational-wave transient from instrumental ar- 
tifact is to check that a signal appears in two or more geo- 
graphically distant detectors. While this coincidence require- 
ment is extremely effective, a high rate of glitches means 
that accidental coincidence of noise transients across multi- 
ple detectors still dominates the search background, result- 
ing in weaker upper limits and making the confident detec- 
tion of real signals challenging. The problem is most se- 
vere in searches for transients with poorly-modeled or little 
identifying waveform structure, such as generic gravitational- 
wave bursts or intermediate-mass binary black-hole coales- 
cence which spend only a short amount of time (few cycles) 
in the LIGO sensitive band during merger 

While the precise noise characteristics in the advanced de- 
tectors will be different from those of initial LIGO, glitch 
sources for future data will exist and the detection problem 
for short duration signals will persist. Thus, it is critical to 
develop data analysis methods for the robust identification of 
glitches in LIGO data. For previous investigations of instru- 
mental artifacts, their impact on gravitational-wave searches 
and the methods for their identification, see lf7l 4T4ll . We use 
the Ordered Veto List (OVL) algorithm as a benchmark for 
our investigations. OVL has been used in recent LIGO sci- 
ence runs as one of the primary glitch detection algorithms. 
In particular, an earlier version of OVL described in [151 was 
used during LIGO's fifth science run lfT6l . OVL attempts to 
measure the degree of likelihood that a gravitational wave can- 
didate can be associated with a transient instrumental distur- 
bance found in one of the many auxiliary channels. 

Glitches are induced by the detector's environment, noise in 
the detector subsystems, or a combination of thereof. These 
sources should appear in the auxiliary channels as well. Be- 
cause legitimate gravitational waves may couple to channels 
besides the nominal gravitational wave channel, we use a sub- 
set of auxiliary channels shown to be insensitive to gravita- 
tional waves. This subset is generated through hardware in- 
jections at the detectors 1 13 1. The hardware injections involve 
driving the test masses through magnetic coupling with an ex- 
pected gravitational wave signal and searching for evidence 
of that signal in auxiliary channels. If the signal does not 
systematically appear in an auxiliary channel, that channel is 
deemed "safe" and we include it in our analysis. By analyz- 
ing information from these auxiliary channels, one may be 
able to distinguish glitches from genuine gravitational-wave 
signals and ideally establish their cause. The main difficulty 
in such an analysis is processing the information from hun- 
dreds of channels which may manifest non-trivial correlations 
between themselves when they respond to an instrumental 
disturbance. Given the high dimensionality and the absence 
of reliable statistical models for noise and coupling between 
auxiliary channels, traditional computational methods are not 
well-suited to this problem. On the other hand. Machine learn- 
ing algorithms (MLA)s have been used to solve problems like 



this since the 1970's in other fields such as computer science, 
biology, and finance. 

This paper presents the use of MLAs for the purpose of 
glitch identification in gravitational-wave detectors. The main 
goal of the paper is to establish the feasibility of apply- 
ing MLAs in the context of the LIGO detectors. We con- 
sider three well-known algorithms: the Artificial Neural Net- 
work (ANN), the Support Vector Machines (SVM), and the 
Random Forest (RF). We explore their properties and test 
their performance by analyzing data from past scientific runs. 
Based on these tests, we discuss the prospects for using MLAs 
for glitch identification in the advanced LIGO detectors. 

This paper is organized as follows. In Section [11] we 
describe the process for reducing raw time-series data and 
preparing feature vectors for the MLA classifiers. This is fol- 
lowed by a general formulation of the glitch detection prob- 
lem in Sectionjlll] Then, in Section|lV] we briefly describe the 
classifiers' algorithms. Training and testing of the classifiers 
is discussed in Section [V] Finally, we evaluate and compare 
the classifiers' performances using the standard Receiver Op- 
erating Characteristic (ROC) curves in Section IVI] and inves- 



tigate various ways of combining classifiers in Section VII In 
Appendix [a] we explore several optimization criteria used by 
the classifiers and verify their theoretical consistency. 



II. DATA PREPARATION 

We use data taken by the 4km-arm detector at Hanford, WA 
(HI) during LIGO's fourth science run (S4: 24 February - 
24 March 2005), and data taken by the 4km-arm detector at 
Livingston, LA (LI) during one week (28 May - 4 June 2010) 
of LIGO's sixth science run (S6: 7 July 2009 - 20 October 
2010). Hereafter we refer to these data sets as the S4 and the 
S6 data. 

In the time between the fourth and the sixth science runs, 
the detectors underwent major commissioning and improve- 
ments to their sensitivity. Thus, while the HI and LI detec- 
tors are identical by design, the data taken by HI during S4 
and the data taken by LI during S6 are quite different. These 
data sets represent evolutionary changes in both the detec- 
tor noise power-spectral-density and the non-Gaussian tran- 
sient artifacts. Differences in the detectors' environments due 
to their distant geographical locations add another degree of 
freedom. Processing data from detectors separated in time 
and location allows us to determine how adaptable and robust 
these analysis algorithms are. This is important when extrap- 
olating their performance to advanced detectors. 

Classification, or the separation of input data into various 
categories, is one of MLAs main uses; thus, they are often 
referred to as classifiers. We have two categories of data: 
glitches (Class 1) and "clean" data (Class 0). If one was to 
perform a search for gravitational-wave transient signals, the 
first category, glitches, would generally be identified as candi- 
date transient events and considered false alarms. The second 
category, "clean" data, contain only Gaussian detector noise. 
A true gravitational-wave signal, when it arrives at the detec- 
tor, is superposed on the Gaussian detector noise. If signal's 
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amplitude is high enough, it also would be identified by the 
search algorithm as a candidate transient event. Since it is 
a genuine gravitational-wave transient, it would constitute an 
actual detection, as opposed to glitches which act as noise. 
Hereafter we refer to such candidate gravitational-wave tran- 
sient events, either genuine gravitational-wave transients or 
glitches, as transient events or, simply, as events. 

We characterize a transient event in either class by infor- 
mation from the detector's auxiliary channels. Importantly, 
we record the same information for both classes of events. 
Each channel records a time-series measuring some non- 
gravitational-wave degree of freedom, either in the detector 
or its environment. We first reduce the time-series to a set of 
non-Gaussian transients using the Kleine Welle analysis algo- 
rithm 1 17 1, which finds clusters of excess signal power in the 
dyadic wavelet domain. The detected transients are ranked 
by their statistical significance, defined as the negative loga- 
rithm of the probability that a random cluster of wavelet co- 
efficients subject to Gaussian noise would contain the same 
or greater signal power. The MLA classifiers use the proper- 
ties of auxiliary channel transients coincident in time with the 
gravitational-wave channel event to classify the gravitational- 
wave event. 

Given an event in the gravitational-wave channel at time t, 
we build a feature vector x out of the nearby auxiliary channel 
transients. Each channel contributes five features: 

• p: The significance of the single loudest transient in that 
auxiliary channel within ±100 ms of t. 

• At: The difference between t and the central time cor- 
responding to the auxiliary channel transient. 

• d: The duration of the auxiliary channel transient. 

• /: The central frequency of the auxiliary channel tran- 
sient. 

• n: The number of wavelet coefficients clustered to 
form the auxiliary channel transient (a measure of time- 
frequency volume). 

We require all auxiliary transients to have a significance of at 
least 15, and if no such auxiliary transient is found within 100 
ms of t, the five fields for that channel are set to zero. The 
significance threshold of 15 and 100 ms window are tunable 
parameters. The 100 ms window covers most transient cou- 
pling timescales identified by previous studies |[T3l . However, 
there is no guarantee that this window is an optimal choice, or 
that it should be the same for all auxiliary channels. In total, 
we analyze 250 (162) auxiliary channels from S6 (S4) data, re- 
sulting in 1250 (810) dimensions for the auxiliary feature vec- 
tor, X. In addition, we record certain bookkeeping information 
about the original gravitational-wave channel event, the state 
of nearby non-Gaussian transients in the gravitational-wave 
channel, and other information about data quality. These val- 
ues are stripped before classifier training and evaluation so 
that we train the classifiers on only information contained in 
the auxiliary features. 

The set of "glitch" (Class 1) samples, {a;}i, is gener- 
ated by running Kleine Welle over the gravitational-wave 



channel from one of the LIGO detectors. This set of non- 
Gaussian transients from the gravitational-wave channel can, 
in principle, contain true gravitational-waves. However, with- 
out requiring coincident transients in another detector, they 
are overwhelmingly dominated by noise artifacts. Even for 
the most sensitive data set (S6), the expected rate of de- 
tectable gravitational-wave transients from known astrophys- 
ical sources is extremely low (~'10^^ Hz [51) with respect to 
the rate of single-detector noise transients (~0.1 Hz). For the 
advanced LIGO detectors, it may be appropriate to remove co- 
incident gravitational-wave candidates from the glitch training 
samples to avoid contamination from detectable gravitational- 
wave events. In both classifiers' training and performance 
evaluation, we treat all Kleine Welle transients from the 
gravitational-wave channel as artifacts. In total, we identify 
2832 (16,204) noise transients above a nominal significance 
threshold of 35 from the Livingston LI (Hanford HI) detector 
during one week of the S6 (four weeks of the S4) science run. 
The central time from each event is used to trigger feature vec- 
tor generation, so that {a;}i is a set of 2832 (16,204) sample 
vectors, each described by 1250 (810) features derived from 
coincident auxiliary channel information. The samples are 
most representative of the background in gravitational-wave 
burst searches which generally target short, unmodeled sig- 
nals. 

"Clean" (Class 0) samples, {x}q, are formed by first gen- 
erating 10^ randomly distributed times within the data from 
each detector, producing a Poisson distribution mimicking (at 
a greatly exaggerated rate) that which would come from a set 
of true gravitational-wave signals (ignoring effects like detec- 
tor sensitivity). It can also be seen as a sampling of times 
when there was no glitch in the gravitational-wave channel, 
and will sample the typical auxiliary transients that do not 
produce a gravitational-wave glitch. To further aid in distin- 
guishing times when there is no disturbance, we exclude Class 
samples which fall within ±100 ms of a Class 1 sample. As 
with Class 1, the full set of Class samples {a;}o is built from 
auxiliary channel information nearby each randomly selected 
time. 



III. GENERAL FORMULATION OF THE DETECTION 
PROBLEM 

The data analysis problem which we address here can be 
formulated as the robust identification of transient artifacts 
(glitches) in the gravitational-wave channel based on the in- 
formation contained in auxiliary detector channels. Clearly, 
the solution to this problem is directly related to the solution 
to the ultimate problem of robust detection and classification 
of gravitational-wave transients in LIGO data. The identifi- 
cation of glitches will reduce the non-Gaussian background 
and improve the sensitivity of gravitational-wave searches. We 
leave the question of how the results of our current analysis of 
the auxiliary channels can be incorporated into the search for 
transient gravitational waves to future work. 

For a given transient event in the gravitational-wave chan- 
nel, we construct a feature vector of auxiliary information, x, 
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following the procedure outlined in Section |ll] Our detection 
problem reduces to binary prediction on whether this transient 
is a glitch (Class 1) or a clean sample (Class 0) based on x and 
only X. In feature-space, x e Vd, this binary decision can be 
mapped into identifying domains for Class 1 events, Vi, and 
Class events, Vq. We call the surface separating the two 
regions the decision surface. Unless the two classes are per- 
fectly separable, which is typically not the case, there is a non- 
zero probability for an event of one class to occur in a domain 
identified with a different class. In this case, one would like 
to find an optimal decision surface separating two classes in 
such a way that we maximize the probability of finding events 
of Class 1 in Vi at fixed probability of miscatagorizing events 
from Class in Vi. This essentially minimizes the probabil- 
ity of incorrectly classifying events. The former probability. 
Pi, represents the probability of glitch detection which we 
also call detection efficiency, and the latter probability, Pq, is 
called the/fl/ie alarm probability. This optimization principle 
is often referred to as the Ney man-Pearson criteria 1 18|. 

The probability of detection and the probability of false 
alarm can be expressed in terms of conditional probability 
density functions for the feature vector, x: 

Pi= f e{fix)-F*)pix\l)pil)dx, (la) 



Po^ eif{x)~F*)p{x\0)p{0)dx. (lb) 

Herep(a; 1 1) andp{x \ 0) define probabihty density functions 
for the feature vector in the presence and absence of a glitch in 
gravitational-wave data, respectively. p{l) and p(0) are prior 
probabilities for having a glitch or clean data, related to one 
another via p(l) + p{0) = 1.0 {f{x) — F*) defines the re- 
gion Vi which signifies a glitch in the gravitational-wave data, 
and f{x) — F* defines the decision surface. F* is a thresh- 
old parameter, which corresponds to a specific value of the 
probability of false alarm through i 



lb I 



The optimal decision surface is found by maximizing the 
functional 



S[f{x)]=P,[f{x)]-lo {Po[fix)]-P*) 



(2) 



where Pq is a tolerable value for the probability of false alarm 
and ^0 is a Lagrange multiplier Setting the variation of this 
functional with respect to f{x) to zero leads to the condition 
for the points on the decision surface 



P{x I l)p(l) 
pix I O)p(O) 



CONSTANT . 



The ratio of conditional probability density functions, 

p{x I 1) 



p{x I 0) 



(3) 



(4) 



is called the Ukelihood ratio (sometimes also referred to as 
the Bayes factor). The CONSTANT in the optimality condition 
(j3]l does not carry any special meaning, and the condition can 



be satisfied if the decision surface is defined as the surface of 
constant likelihood ratio IJ9J . 



fix) = A{x) = F* 



(5) 



with F* set by the probability of false alarm, Pq, through 
( fTb] ). Actually, the decision surface can be defined by any 
monotonic function of the likelihood ratio with trivial redefi- 
nition of F* . There is a unique decision surface for each value 
of Pq S [0, 1], and we can label decision surfaces by their cor- 
responding values of Pq . 

The optimization of (|2| maximizes the detection probabil- 
ity. Pi — > Pi''^, for every value of the probability of false 
alarm, Pq = Pq*. The curve Pi°"(Po) is called the Re- 
ceiver Operating Characteristic (ROC) curve. It is a standard 
measure of any detection algorithm's performance. We can 
think of optimizing Q as maximizing the area under the ROC 
curve. For further details on use of the likelihood ratio in the 
gravitational-wave searches, see 



Finding the optimal decision surfaces by direct estimation 
of the conditional probability density functions, p{x \ 1) and 
p{x I 0), is an extremely difficult task if the feature vector con- 
tains more than a few dimensions. For high-dimensional prob- 
lems, when no parametric model for these probability distri- 
butions is known and with a limited number of experimental 
samples that can be used to estimate these probability density 
functions, one has to resort to some other method. MLA are 
well-suited for these detection problems. 

In this paper, we consider three popular MLAs: ANN, 
SVM and RF. They differ significantly in their underlying 
algorithms and their approaches to classification. This allows 
us to investigate the applicability of different types of MLA to 
glitch identification in the gravitational-wave data. However, 
all MLA require training samples of events from both Class 
1 and Class 0. The MLA classifiers use the training sets to 
find an optimal classification scheme or decision surface. In 
the limit of infinitely many samples and unlimited computa- 
tional resources, different classifiers should recover the same 
theoretical result, the decision surface defined by the constant 
likelihood ratio (|5]l. To this end, it is critical that classifiers 
are trained and optimized using criteria consistent with this 
result. In Appendix [A] we explore several standard optimiza- 
tion criteria and derive the decision surfaces they generate in 
this theoretical limit. We find that all of these criteria lead to a 
decision surface with constant likelihood ratio. In particular, 
this is true for the fraction of correctly classified events and 
the Gini index criteria that ai-e used by ANN /SVM and RF, 
respectively. 

While all classifiers we investigate here should find the 
same optimal solution with sufficient data, in practice, the al- 
gorithms are limited by the finite number of samples in the 
training sets and by computational cost. The classifiers have 
to handle a large number of dimensions efficiently, many of 
which might be redundant or irrelevant. By no means it is 
clear that the MLA classifiers will perform well under such 
conditions. It is our goal to demonstrate that they do. 

We evaluate their performance by computing ROC curves. 
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These curves, which map the classifiers' overall efficiencies, 
are objective and can be directly compared. In addition to 
comparing the MLA classifiers to each other, we benchmark 
them using ROC curves from the OVL algorithm |21 1. This 
method constructs segments of data to be vetoed using a hard 
time window and a threshold on the significance of transients 
in the auxiliary channels. The veto segments are constructed 
separately for different auxiliary channels and are applied 
in the order of decreasing correlation with the gravitational- 
wave events. By construction, only pairwise correlations be- 
tween a single auxiliary channel and the gravitational-wave 
channel are considered by the OVL algorithm. These results 
have a straightforward interpretation and provide a good san- 
ity check. 

In order to make the classifier comparison as fair as possi- 
ble, we train and evaluate their performances using exactly the 
same data. Furthermore, we use a round-robin procedure for 
the training-evaluation cycle, which allows us to use all avail- 
able glitch and clean samples. Samples are randomized and 
separated into ten equal subsets. To classify events in the A;*'* 
subset, we use classifiers trained on all but the fc*'' subset. In 
this way, we ensure that training and evaluation are done with 
disjoint sets so that any over-training that might occur does 
not bias our results. 

An MLA classifier's output is called a rank, tmla G [0,1], 
and a separate rank is assigned to each glitch and clean sam- 
ple. Higher ranks generally denote a higher confidence that 
the event is a glitch. A threshold on this rank maps to the 
probability of false alarm, Pq, by computing the fraction of 
clean samples with greater or equal rank. Similarly, the prob- 
ability of detection or efficiency. Pi, is estimated by comput- 
ing fraction of glitches with greater or equal rank. Essentially, 
we parametrically define the ROC curve, Pi^^{Po), with a 
threshold on the classifier's rank. Synchronous training and 
evaluation of the classifiers allow us to perform a fair compar- 
ison and to investigate various ways of combining the outputs 
of different classifiers. We discuss our findings in detail in 
Sections [VllandlVlll 



IV. OVERVIEW OF THE MACHINE LEARNING 
ALGORITHMS 

In this section, we give a short overview of the basic prop- 
erties of the classifiers and the tuning procedures used to de- 
termine the best performing configurations for each classifier 



More traditional veto approaches to data quality in gravitational-wave 
searches use another measure of veto quality. For a given veto configu- 
ration consisting of a list of disjoint segments of data, the fractional "dead- 
time" is computed from the sum of the durations of all data segments to be 
vetoed. While not precisely the same, this quantity is related to the proba- 
bility of false alarm, Pq ■ which accounts only for the fraction of clean data 
removed from the search. For a typical rate of glitches of ~0. 1 Hz, the two 
measures are almost identical in the most relevant region of Pq < 0.01. 
Thus, in that interval the ROC curves of this paper can be directly com- 
pared to the often used figure of merit, efficiency vs. fractional dead-time. 
See for example L13J . 



Throughout this section, we will use the notation Xi where 
i — 1,2, ...N to denote the set of N sample feature vectors. 
Similarly, will denote the actual class associated with the 
the z'^ sample feature vector, either Class or Class 1 . Predic- 
tions about a feature vector's class will be denoted by y{xi). 

A. Artiflcial Neural Network 

An Artificial Neural Network is a machine learning tech- 
nique based on simulating the data processing in human brains 
and mimicking biological neural networks [i22l [23l . As is 
well-known, the human brain is composed of a tremendous 
number of interconnected neurons, with each cell perform- 
ing a simple task (responding to an input stimulus). How- 
ever, when a large number of neurons form a complicated 
network structure, they can perform complex tasks such as 
speech recognition and decision-making. 

A single neuron is composed of dendrites, a cell body, and 
an axon. When dendrites receive an external stimulus from 
other neurons, the cell body computes the signal. When the to- 
tal strength of the stimulus is greater than the synapse thresh- 
old, the neuron is fired and sends an electrochemical signal to 
other neurons through the axon. This process can be imple- 
mented with a simple mathematical model including nodes, 
a network topology and learning rules adopted to a specific 
data processing task. Nodes are characterized by their num- 
ber of inputs and outputs (essentially how many other nodes 
they talk to), and by the connecting weights associated with 
each input and output. The network topology is defined by 
the connections between the neurons (nodes). The learning 
rules prescribe how the connecting weights are initialized and 
evolve. 

There are a large number of ANN models with different 
topologies. For our purpose, we choose to implement the 
multi-layered perceptron (MLP) model, which is one of the 
most widely used models. The MLP model has input and out- 
put layers as well as a few hidden layers in between. The in- 
put vector for the input layer is the auxiliary feature vector, x, 
while the input for hidden layers and the output layer is a com- 
bination of the output from nodes in the previous layer We 
will call these intermediate vectors z to distinguish them from 
the full feature vector. Each layer has several neurons which 
are connected to the neurons in the adjacent layers with indi- 
vidual connecting weights. The initial structure - the number 
of layers, neurons, and the initial value of connecting weights 
- is chosen by hand and /or through an optimization scheme 
such as a genetic algorithm (GA). 

When a neuron's input channels receive an external signal 
exceeding the threshold value set by an activation function, the 
neuron is fired. This process can be expressed mathematically 
as: 

y(z) ^f{wz + b), (6) 

where y{z) is the output, z is an input vector, w are connect- 
ing weights, / is an activation function, and 6 is a bias. One 
may choose the activation function to be either the identity 



6 



function, the ramp function, the step function, or a sigmoid 
function. We use the sigmoid function defined by 

+ = (l + e-^"('""^+''))"\ (7) 

We set the activation steepness s = 0.5 in hidden layers 
and s = 0.9 at the output layer There is a single neuron at the 
output layer, and the value of that neuron's activation function 
is used as ANN's rank, tann- 

The topological parameters determine the number of con- 
nection weights, which must be sufficiently large so that ANN 
has enough degrees of freedom to classify a given datum. The 
network's flexibility depends on the number of connection 
weights and should be matched with the size of the training 
sets and the input data's dimensionality. In our work, the num- 
bers of layers and neurons are chosen so that the total number 
of connection weights is on the order of 10^ when using the 
entire data set. In order to avoid overtraining, we decrease the 
number of layers and neurons in the runs in which either the 
dimensionality or the number of training samples are reduced. 

The learning scheme finds the optimal connection weights, 
w, in each layer In this paper we use the improved version 
of the Resilient back PROPagation (iRPROP) algorithm |,24J , 
which minimizes the error between the output value, y{xi), 
and the known value, yi. In this algorithm, the increase (de- 
crease) factor and the minimum (maximum) step-size deter- 
mine the change in the connection weights. Aw, at each itera- 
tion during the training. In our work, the default values in the 
Fast Artificial Neural Network (FANN) library |25 1 are used 
for all parameters except the increase factor, which is set to 
77+ = 1.001. The same learning rules were used in all runs. 
We should note that ANN can be optimized in an alternative 
way, via a GA or other similar methods. When using a GA, 
a combined optimization algorithm for topology, feature and 
weight selection can be applied to improve the performance 
of ANN Il26l430l . We explore these options in a separate pub- 
lication Ell- 
in addition to the choosing the ANN configuration param- 
eters, we found that ANN requires data pre-processing. In 
ANN, higher absolute values of the input variables have more 
effect on the output values, so all components of the feature 
vector, X, are normalized to the range [0, 1]. To better resolve 
small Ai values. At is transformed via a logarithmic function 
before normalization. 

At' = -sign(At) log |At|. (8) 

This transformation improves ANN's ability to identify 
glitches, which tend to have smaller values of At. You can 
find a more detailed description of the procedure for tuning 
the ANN configuration parameters in [1311 . 

B. Support Vector Machines 

A Support Vector Machine is a machine learning algorithm 
for binary classification on a vector space lf32l[33l . It finds the 



optimal hyperplane that separates the two classes of training 
samples. This hyperplane is then used as the decision surface 
in feature space, and classifies events depending on which side 
of the hyperplane they fall. 

As before, let {{xi^yi) \i — 1,2, ...N} be the training data 
set, where Xi is the feature vector of auxiliary transient infor- 
mation near time t^, and yi E {1, —1} is a label that marks the 
sample as either Class 1 or Class 0, respectively. Then assume 
that the training set is separable by a hyperplane w-x — b — 0, 
where w is the normal vector to the hyperplane and b is the 
bias. Then the training samples with yi — I satisfy the condi- 
tion w ■ Xi — b > 1, and the training samples with = — 1 sat- 
isfy the condition w ■ Xi — b < —1. SVM uses a quadratic pro- 
gramming method to find the w and b that maximize the mar- 
gin between the hyperplanes w-x — b— 1 and w-x — b — —1. 

If the training samples are not separable in the original fea- 
ture space, Vd, SVM uses a mapping, <j){x), into a higher di- 
mensional vector space, V^, in which two classes of events 
can be separated. The decision hyperplane in corresponds 
to a non-linear surface in the original space, Vd. Thus, map- 
ping the problem into a higher dimensional space allows SVM 
to consider non-linear decision surfaces. The dimensionality 
of grows exponentially with degree of the non-linearity of 
the decision surfaces in Vd. As a result, SVM can not con- 
sider arbitrary decision surfaces and usually has to deal with 
non-separable populations. If the training samples are not sep- 
arable after mapping, a penalty parameter, C, is introduced to 
weight the training error Finding the optimal hyperplane is 
reduced to the following quadratic programming problem: 

min dw-w + Cy^^i] , (9a) 

subject to yi ■ {w ■ (j>{xi) + b) > 1 - £,i , (9b) 
e;>0, 1 = 1,2,..., iV. (9c) 

When the solution is found, SVM classifies a sample x by the 
decision function: 



y{xi) = sign {w ■ 0(a;^) + b) . (10) 

In solving the quadratic programming problem, the func- 
tion is not explicitly needed. It is sufficient to specify 
(t){xi)-(f){xj). The function if (a; i, a; j) — 0(a:i)-(/)(a;j) is called 
the kernel function. The form of the kernel function implicitly 
defines the family of surfaces in Vd over which SVM is opti- 
mizing. In this study we use the Radial Basis function (RBF) 
as the kernel function. It is defined as 

K{xi,Xj)=ex]i{-"f\\xi-Xj\\^] , (11) 

where 7 is a tunable parameter. 

The SVM algorithm was implemented by using the open- 
source package libsvm [34|. As part of this package, the 
kernel function parameter, 7, and the penalty parameter, C, 
are tuned in order to achieve the best performance for a spe- 
cific application. The best parameters (C and 7) are selected 
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through an exhaustive search. For each pair of parameters 
(log C, log 7) on a grid, we calculate a figure of merit. The 
parameters with the best figure of merit are then used for clas- 
sifying events. The default figure of merit in libsvm is the 
accuracy (fraction of correctly classified events). However, 
we replace it with a figure of merit better adapted to glitch de- 
tection. Instead of using the accuracy, our code calculates the 
area under the estimated ROC curve (-P{'^^(Po)) in the interval 
of the probability of false alarm Pq e [0.001, 0.05] on a log- 
linear scale ([0.001, 0.05] is a range of acceptable probability 
of false alarm for practical glitch detection). 



Po=0.05 



figure of merit = 



d(lnPo)Pr(Po) (12) 



Po=0.001 

Performing an exhaustive search for the best SVM parame- 
ters is computationally expensive. We can speed up this tuning 
process by exploiting the fact that the tuning time grows non- 
linearly with the training sample size. By using smaller train- 
ing sets, we can reduce the total time spent determining the 
optimal parameters. We randomly selected p subsets of vec- 
tors from the training set, with each subset 10 times smaller 
than the full training set. The best pair of the SVM parame- 
ters for each of the p subsets was then calculated, with each 
subset optimization running on a single CPU core. This gives 
p sets of best parameters, calculated in parallel. The parame- 
ters C and 7 that are selected the most often were then cho- 
sen as the final best SVM parameters. This modified parame- 
ter optimization algorithm was applied to various training sets 
(described in Section|V]l. We found that the optimal SVM pa- 
rameters do not depend on the training set. We, therefore, use 
the same parameters for all calculations reported in this paper 
(C = 8and7 = 0.0078125). 

In its standard configuration, SVM classifies samples by a 
discrete label, y e {1,-1}. However, the libsvm package can 



provide a probability based version of ( 10 1 that yields con- 
tinuous values, rsvM G [0, 1] fi5\. We use these continuous 
values as the output of the SVM classifier. 



C. Random Forest Technology 

Random forest technology 1361 l37l improves upon the clas- 
sical decision tree |38 39l approach to classification. The 
classifying decision tree performs a series of binary splits on 
any /all of the dimensions of the feature vector, x, that de- 
scribes an event. The goal is to distribute events into groups 
consisting of only a single class. In a machine-learning con- 
text, the decision tree is formed by "training" it on a set of 
events of known class. A series of splits are made, where each 
split chooses the dimension and threshold that optimizes a cer- 
tain criterion, such as the fraction of correctly classified train- 
ing events or the Gini index. Splitting stops once no split can 
further improve the optimization criterion or the limit on the 
minimum number of events allowed on a branch (the furthest 
reaches of a decision tree) is reached; at this point the branch 
becomes a leaf. Once a tree is formed, an event of unknown 



class is fed into the tree, and depending on its feature vector, 
X, it will be labeled as either Class or Class 1. However, a 
single decision tree can be a victim to both false minima and 
over- training. To guard against this, we create a forest of de- 
cision trees and average over their answers; this results in a 
continuous ranking, trf G [0, 1], rather than a binary classifi- 
cation, as events can be placed on a continuum between Class 
and Class 1 . 

Each decision tree in the forest is trained on a bootstrap 
replica of the original training set. If the original training set 
has N events, each bootstrap replica will also have N events, 
which are chosen randomly with replacement - so that any 
given event may be picked more than once. Therefore, each 
tree gets a different set of training events. To further avoid 
false minima, random forest technology chooses a different 
random subset of the features to be available for splitting at 
each node. This ensures that a peculiarity in a particular di- 
mension does not dominate the decision making process. 

We use the StatPatternRecognition software package's |40| 
implementation of RF. The key input parameters are the num- 
ber of trees in the forest, the number of features randomly 
selected for splitting at each tree node (branching point), 
the minimum number of samples on the terminal tree nodes 
(leaves) and the optimization criterion. To determine the best 
set of the RF parameters, we perform a search over a coarse 
grid in the parameter space, maximizing efficiency or the 
detection probability. Pi, at the probability of false alarm, 
Pq = 0.01. We find that beyond a certain point, the RF ef- 
ficiency grows very slowly with the number of trees and the 
number of features selected for splitting at the cost of a sig- 
nificant increase in running time during the training process. 
Taking this into account, we arrive at the following configu- 
ration which we use in all runs: 100 trees in the forest, 64 
features for splitting, a minimum of 8 samples on a node, and 
the Gini index as the optimization criterion. 



D. Ordered Veto List Algorithm 

The OVL algorithm operates by looking for coincidences 
between the transients in gravitational-wave and auxiliary 
channels. Specifically, the transients identified in the auxiliary 
channel are used to construct a list of time segments. All tran- 
sients in the gravitational-wave channel occurring within these 
segments are removed from the list of transient gravitational- 
wave candidates. In effect, the data in these time segments are 
vetoed prior to any search for gravitational- waves. 

The algorithm applies a hierarchical approach, assuming 
that transients in certain auxiliary channels are more cor- 
related with the glitches in gravitational-wave channel, and 
looks for a hierarchy of different types of correlations between 
auxiliary and gravitaitonal-wave glitches. Specifically, a se- 
ries of veto configurations is created, corresponding to dif- 
ferent auxiliary channels, the time windows around transients 
and the threshold on their significance. The ordered list cor- 
responds to a list of these configurations, and veto configura- 
tions are applied to the data in order of decreasing correlation. 
For this study, the maximum time window is set to ± 100 ms 
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to match the one we use to create auxiliary feature vectors for 
the MLA classifiers (Section|ll]i. Similarly, the lowest thresh- 
old on significance, p, is set to the auxiliary channel nominal 
threshold of 15. For each channel, the number of possible veto 
configurations is equal to the number of unique combinations 
that can be constructed from the list of the time windows, [± 
25 ms, ± 50 ms, ±100 ms], and the significance thresholds, 
[15, 25, 30, 50, 100, 200, 400, 800, 1600]. 

Importantly, a segment removed by a veto configuration is 
not seen by later configurations. This prohibits duplicate ve- 
toes and results in a measurement of the additional informa- 
tion contained in subsequent veto configurations. The per- 
formance of each configuration is evaluated and they are re- 
ranked accordingly. The OVL algorithm defines the veto- 
configuration rank, tovl, as the ratio of the fraction of 
gravitational-wave glitches removed to the fraction of analy- 
sis time removed. Repeated application of the algorithm pro- 
duces an ordered list with the best performing configurations 
appearing first. 

Only some of the veto configurations make it to the final 
list. Those which perform poorly (rovL < 3) are discarded. 
This is done in order to get rid of irrelevant or redundant chan- 
nels and to speed up the algorithm's convergence. Typically, 
the OVL algorithm converges within less than 10 iterations to 
a final ordered list. We find that only 47 out of 162 auxiliary 
channels in S4 data and 35 out of 250 auxiliary channels in 
S6 data appear on the final list. Below, we refer to this sub- 
set of channels as the "OVL auxiliary channels." For a more 
detailed description of the OVL algorithm, see |21 1. 

The procedure for optimizing the ordered list of veto con- 
figurations can be considered as training phase. An ordered 
list of veto configurations, optimized for a given segment of 
data, can be applied to another segment of data. Veto segments 
are generated based on the transients in the auxiliary channels 
and the list of configurations. Performance of the algorithm 
is evaluated by counting fractions of removed glitches and 
clean samples, and computing the ROC curve. Similarly to 
MLA classifiers we use the round-robin procedure for OVL's 
training-evaluation cycle. 



V. TESTING THE ALGORITHMS' ROBUSTNESS 

One of the main goals of this study is to establish if MLA 
methods can successfully identify transient instrumental and 
environmental artifacts in LIGO gravitational-wave data. The 
potential difficulty arises from the high dimensionality and 
the fact that information from a large number of dimensions 
might be either redundant or irrelevant. Furthermore, the ori- 
gin of a large fraction of glitches is unknown in the sense that 
their cause and effect have not been pinpointed to a single in- 
strumental or environmental source. In the absence of such 
deterministic knowledge, one has to monitor a large num- 
ber of auxiliary channels and look for statistically significant 
correlations between transients in these channels and tran- 
sients in the gravitational-wave channel. These correlations, 
in principle, may involve more than one auxiliary channel 
and may depend on the transients' parameters in an extremely 



complicated way. Additionally, new kinds of artifacts may 
arise if one of the detector subsystems begin to malfunction. 
Likewise, some auxiliary channels' coupling strengths to the 
gravitational-wave channel may be functions of the detector's 
state (e.g. optical cavity configuration and mirror alignment). 
Depending on the detector's state, the same disturbance wit- 
nessed by an auxiliary channel may or may not cause a glitch 
in the gravitational-wave channel. This information can not be 
captured by the Kleine Welle-derived parameters of the tran- 
sients in the auxiliary channels alone and requires extending 
the current method. We leave this problem to future work. 

Because of the uncertainty in the types and locations of 
correlations, we include as many auxiliary channels and their 
transients' parameters as possible. However, this forces us 
to handle a large number of features, many of which might 
be either redundant or irrelevant. The MLA classifiers may 
be confused by the presence of these superfluous features and 
their performance may suffer One can improve performance 
by reducing the number of features and keeping only those 
that are statistically significant. However, this requires pre- 
processing the input data and tuning, which may be extremely 
labor intensive. On the other hand, if the MLA classifier can 
ignore irrelevant dimensions automatically without a signifi- 
cant decrease in performance, it can be used as a robust anal- 
ysis tool for real-time glitch identification and detector char- 
acterization. By efficiently processing information from all 
auxiliary channels, a classifier will be able to identify new ar- 
tifacts and help to diagnose problems with the detector 

In order to determine our classifiers' robustness, we per- 
form a series of runs in which we vary the dimensionality of 
the input data and evaluate the classifiers' performance. First, 
we investigate how their efficiency depends on which tran- 
sient parameters are used. We expect that not all of the five 
parameters, (p. At, /, d, n), are equally informative. Naively, 
p and At, reflecting the disturbance's amplitude in the auxil- 
iary channel and its degree of coincidence with the transient in 
gravitational-wave channel, respectively, should be the most 
informative. Potentially, the frequency, /, duration, d, and 
the number of wavelet coefficients, n, may carry useful infor- 
mation if only certain auxiliary transients produce glitches. 
However, it is more likely that these parameters are corre- 
lated with the corresponding parameters of gravitational-wave 
transient, which we do not incorporate in this analysis. Such 
correlations, even if not broadened by frequency dependent 
transfer functions, would require analysis specialized to spe- 
cific gravitational-wave signals and goes beyond the scope of 
this paper We perform a generic analysis, not relying on the 
specific characteristics of the gravitational-wave transients. 

Anticipating that some of the parameters could be irrele- 
vant, we prepare several data sets by removing features from 
the list: {p, At, f, d, n). We prepare these data sets for 
both S4 and S6 data and run each of the classifiers through 
the training-evaluation round-robin cycles described in Sec- 
tion III We evaluate their performance by computing the 



ROC curves, shown in Figure [T] 

We note the following relative trends in the ROC curves 
for all classifiers. The omission of the transient's duration, 
d, and the number of wavelets, n, has virtually no effect on 
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False Alarm Probability False Alarm Probability False Alarm Probability 

(d) S6 ANN (e) S6 SVM (f) S6 RF 

FIG. 1. Varying sample features. We expect some of the five features recorded for each auxiliary channel to be more useful than others. To 
quantitatively demonstrate this, we train and evaluate our classifiers using subsets of our sample data, with each subset restricting the number 
of auxiliary features. We observe the general trend that the significance, p, and time difference. At, are the two most important features. 
Between those two, p appears to be marginally more important than At. On the other hand, the central frequency, /, the duration, d, and the 
number of wavelet coefficients in an event, n, all appear to have very little affect on the classifier's performance. Importantly, our classifiers 
are not impaired by the presence of these superfluous features and appear to robustly reject irrelevant data without significant efficiency loss. 
The black dashed line represents a classifiers based on random choice. 



efficiency. The ROC curves are tlie same to within our er- 
ror, which is less than ± 1 % for our efficiency measure- 
ment (This is based on the total number of glitch samples and 
the normal approximation for binomial confidence interval, 
Pi{l — Pi)/N). Omission of the frequency, /, slightly re- 
duces the efficiency of SVM (Figure lb and Figure lei, but 
has no effect on either ANN or RF. A comparison between 
the ROC curves for (p, At), (p) and (At) data sets shows 
that while a transient's significance is the most informative pa- 
rameter, including the time difference generally results in bet- 
ter overall performance. Of the three MLA classifiers, SVM 
seems to be the most sensitive to whether the time difference 
is used in addition to significance. RF, as it appears, relies pri- 
marily on significance, which is reflected in poor performance 
of the (Ai)-only ROC curves in Figure Ic and FigurefT^ The 



trend for ANN is not as clear In S4 data, including timing 



does not change the ROC curve (Figure lai while in S6 data 
it improves it (Figure [Td| . Overall, we conclude that based 
on these tests, most if not all the information about detected 
glitches is contained in p and At pair At the same time, keep- 
ing irrelevant features does not seem to have a negative effect 
on our classifiers' performance. 



The OVL algorithm, which we use as a benchmark, ranks 
and orders the auxiliary channels based on the strength of cor- 
relations between transient disturbances in the auxiliary chan- 
nels and glitches in gravitational-wave channel. The final list 
of OVL channels includes only a small subset of the available 
auxiliary channels, 47 (of 162) in S4 data and 35 (of 250) in S6 
data. The rest of the channels do not show statistically signifi- 
cant correlations. It is possible that these channels contain no 
useful information for glitch identification, or that one has to 
include correlations involving multiple channels and/or other 
features. In the former case, throwing out irrelevant channels 
will significantly decrease our problem's dimensionality and 
may improve the classifiers' efficiency. In the latter case, clas- 
sifiers might be capable of using higher order correlations to 
identify new classes of glitches missed by OVL. 

We prepare two sets of data to investigate these possibili- 
ties. In the first data set, we use only the OVL auxiliary chan- 
nels and exclude information from all other channels. In the 
second data set, we further reduce the number of dimensions 
by using only p and At. We apply classifiers to both data sets, 
evaluate their performance and compare it to the run over the 
full data set (all channels and all features). Figure|2]shows the 
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ROC curves computed for these test runs. 

In both S4 and S6 data, the three curves for RF (Figure |2c] 
and Figure[2^ lay on the top of each other, demonstrating that 
the classifier's performance is not affected by the data reduc- 
tion. ANN shows slight improvement in its performance for 
the maximally reduced data set in the S6 data (Figure[2d|, and 
no discernible change in the S4 data (Figure 2a i. SVM ex- 



the round-robin procedure. Figure |3] demonstrates changes in 
the ROC curves due to the variation of training sets. 

RF performance (Figure 3c and FigurelSB is not affected by 



hibits the most variation of the three classifiers. While drop- 
ping the auxiliary channels not included in the OVL Ust has a 
very small effect on S VM's ROC curve, further data reduction 



leads to an efficiency loss (Figure [2b] and Figure 2ei. Viewed 
together, the plots in Figure |2] imply that, on one hand, non- 
OVL channels can be safely dropped from the analysis, but 
on the other hand, the presence of these uninformative chan- 
nels does not reduce our classifiers' efficiency. This is reas- 
suring. As previously mentioned, one would like to use these 
methods for real-time classification and detector diagnosis, in 
which case monitoring as many channels as possible allows us 
to identify new kinds of glitches and potential detector mal- 
functions. For example, an auxiUary channel that previously 
showed no sign of a problem may begin to witness glitches. If 
excluded from the analysis based on its previous irrelevance, 
the classifiers would not be able to identify glitches witnessed 
by this channel or warn of a problem. 

Another way in which input data may influence a classi- 
fier's performance is by limiting the number of samples in 
the training set. Theoretically, the larger the training sets, 
the more accurate a classifier's prediction. However, larger 
training sets come with a much higher computational cost and 
longer training times. In our case, the size of the glitch train- 
ing set is limited by the glitch rate in the gravitational-wave 
channel and the duration of the detector's run. We remind the 
reader that we use four weeks from the S4 run from the HI 
detector and one week from the S6 run from the LI detec- 
tor to collect glitch samples. One would like to use shorter 
segments to better capture non-stationarity of the detector's 
behavior. However, having too few glitch samples would not 
provide a classifier with enough information. Ultimately, the 
size of the glitch training set will have to be tuned based on 
the detector's behavior. We have much more control over the 
size of the clean training set, which is based on completely 
random times when the detector was operating in the science 
mode. In our simulations, we start with 10^ clean samples, 
but it might be possible to reduce this number without loss of 
efficiency, thereby speeding up classifier training. 

We test how the classifiers' performance is affected by the 
size of the clean training set in a series of runs in which we 
gradually reduce the number of clean samples available. Runs 
with 100%, 75%, 50% and 25% of the total number of clean 
samples available for training are supplemented by a run in 
which the number of clean training samples is equal to the 
number of glitch training samples (16% in S4 data and 2.5% 
in S6 data). In addition, we performed one run in which we re- 
duced the number of glitch training samples by half, but kept 
100% of the clean training samples. While not completely 
exhaustive, we believe these runs provide us with enough in- 
formation to describe the classifiers' behavior. In all of these 
runs, we use all available samples for evaluation, employing 



reduction of the clean training set in the explored range, with 
the only exception of the run over S6 data where size of the 
clean training set is to 2.5% of the original. In this case, the 
ROC curve shows an efficiency loss on the order of 5% at a 
false alarm probability of Pq — 10^^. Also, cutting the glitch 
training set by half does not affect RF efficiency in either S4 
or S6 data. SVM's performance follows very similar trends. 



shown in Figure 3b and Figure 3e It demonstrates robust per- 
formance against the reduction of the clean training set, suf- 
fering appreciable loss of efficiency only in the case of the 
smallest, 2.5% set. Unlike RF, SVM seems to be more sensi- 
tive to variations in size of glitch training set. The ROC curve 
for the 50% ghtch set in S6 data drops 5%-10% in the false 
alarm probability region of Pq — 10^'^ (Figure 3e i. However, 



this does not happen in the S4 run (Figure 3e i. This can be ex- 
plained by the fact that S4 glitch data set has five times more 
samples than S6 set. Even after cutting it in half, the S4 set 
provides better sampling than the full S6 set. 

ANN is affected most severely by training set reduction 



(Figure 3a and Figure 3d i. First, its overall performance vis- 
ibly degrades with the size of the clean training set, espe- 
cially in the S6 runs (Figure [Sd]). Although, we note that the 
ROC curve primarily drops near a false alarm probability of 
Pq — 10^^, it remains the same near Pq — 10^^ (for all 
but the 2.5% set). The higher Pq value is more important in 
practice because the probability of false alarm of 10~^ is still 
tolerable and, at the same time, the efficiency is significantly 
higher than at Pq = 10"'^. This means that we are likely to 
operate a real-time monitor near Pq — 10^^ rather than near 
10"^. Reducing the training sample introduces an artifact on 
ANN'S ROC curves, not seen on either RF or SVM. Here, 
the false alarm probability's range decreases with the size of 
the clean training set. This is due to the fact that with the 
ANN configuration parameters used in this analysis, ANN's 
rank becomes more degenerate when less clean samples are 
available for training, meaning that multiple clean samples in 
the evaluation set are assigned exactly the same rank. This is 
in general undesirable, because a continuous, non-degenerate 
rank carries more information and can be more efficiently in- 
corporated into gravitational-wave searches. The degeneracy 
issue of ANN and its possible solutions are treated in detail in 

El- 

We would like to highlight the fact that in our test runs, we 
use data from two different detectors and during different sci- 
ence runs, and that we test three very different classifiers. The 
common trends that we observe are not the result of peculiari- 
ties in a specific data set or an algorithm. It is reasonable to ex- 
pect that they reflect generic properties of the detectors' auxil- 
iary data as well as the MLA classifiers. Extrapolating this to 
the future applications in advanced detectors, we find it reas- 
suring that the classifiers, when suitably configured, are able 
to monitor large numbers of auxiliary channels while ignor- 
ing irrelevant channels and features. Furthermore, their per- 
formance is robust against variations in the training set size. 
In the next sections we compare different classifiers in their 



11 



0.0 



All channels and parameters 

Only significant OVL channels 

" Only significant OVL channels and (p, A/) 


1 1 








' 1 






1 






^ 1 

1 









10-5 10-4 ig-3 10-2 10 I 

False Alarm Probability 
(a) S4 ANN 



0.0 



All channels and pai-ameters 

Only significant OVL channels 

Only significant OVL channels and (p. , 




10" 10-' 10-* 10-3 10-2 ,0 1 

False Alarm Probability 
(b) S4 SVM 



0.0 



All channels and pariimeters 

Only significant OVL channels 

" Only significant OVL channels and (p. Ar) 


f ) 








f 1 






: 








^ 1 

1 









10" 10-' lO-** 10--3 10-^ 10-' 10" 
False Alarm Probability 

(c) S4 RF 



0.8 



-0.4 



0.0 
10 



All channels and parameters 

Only significant OVL channels 

Only significant OVL channels and (p, A/) 




All channels and pai-ameters 

Only significant OVL channels 

" Only significant OVL channels and (p, At) 


i 

1 
























! 

1 











m-* 10--' 10-- 10 
False Alarm Probability 

(d) S6 ANN 



10-* 10--' 10 - 10 ' 

False Alarm Probability 
(e) S6 SVM 



43 4 



0.0 



All channels and parameters 

Only significant OVL channels 

" Only significant OVL channels and (p. Ar) 












1 








1 

1 

1 

/ 








/ 

1 













10" 10-' lO-t 10-' 10-2 10-1 iqO 

False Alarm Probability 
(f) S6RF 



FIG. 2. Reducing the number of channels. One way to reduce the dimensionality of our feature space is to reduce the number of auxiliary 
channels used to create the feature vector. We use a subset of auxiliary channels identified by OVL as strongly correlated with glitches in the 
gravitational-wave channel (light blue). We notice that for the most part, there is not much efficiency loss when restricting the feature space 
in this way. This also means that very little information is extracted from the other auxiliary channels. The classifiers can reject extraneous 
channels and features without significant loss or gain of efficiency. We also restrict the feature vector to only include the significance, p, 
and the time difference. At, for the OVL auxiliary channels (green). Again, there is not much efficiency loss, suggesting that these are the 
important features and that the classifiers can robustly reject unimportant features automatically. The black dashed line represents a classifier 
that is based on random choice. 



bulk performance as well as in sample-by-sample predictions 
using the full data sets. 



VI. EVALUATING AND COMPARING CLASSIFIERS' 
PERFORMANCE 

The most relevant measure of any glitch detection algo- 
rithm's performance is its detection efficiency, the fraction of 
identified glitches. Pi, at some probability of false alarm, Pq. 
The ROC curve is the key figure of merit and can be used to 
assess the algorithm's efficiency and objectively compare it to 
other methods. It provides an estimate of the algorithm's ef- 
ficiency throughout the entire range of the probabiUty of false 
alarm. (The upper limit for acceptable values of probability of 
false alarm depends on application.) In the problem of glitch 
detection in gravitational-wave data, we set this value to be 
Pq = 10^^, which corresponds to 1% of true gravitational- 
wave transients falsely labeled as glitches. Another way to 
interpret this is that 1 % of the clean science data are removed 
from searches for gravitational waves. 



Our test runs, described in the previous section, demon- 
strate the robustness of the MLA classifiers against the pres- 
ence of irrelevant features in the input data. We are inter- 
ested in measuring a classifiers' efficiency in the common case 
where no prior information about relevance of the auxiliary 
channels is assumed. For this purpose, we use the full S4 and 
S6 data sets, including all channels with a wide selection of 
parameters. Using exactly the same training/evaluation sets 
for all our classifiers allows us to assign four ranks, (rANN, 
''svM^ ''rf^ '"ovl). one from each of the classifiers, to every 
sample and compute the probability of false alarm, Po(^i) and 
efficiency, Pi{ri). While the ranks can not be compared di- 
rectly, these probabilities can. Any differences in classifiers' 
predictions, in this case, are from the details and limitations 
of the methods themselves, and are not from the training data. 

Glitch samples that are separated in time by less than a sec- 
ond are likely to be caused by the same auxiliary disturbance. 
Even if they are not, gravitational-wave transient candidates 
detected in a search are typically "clustered" with a time win- 
dow ranging from a few hundred milliseconds to a few sec- 
onds, depending on the length of the targeted gravitational- 
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FIG. 3. Varying the size of training data sets. In our sample data, the number of ghtches is hmited by the actual glitch rate in the LIGO 
detectors and the length of the analysis time we use. However, we can construct as many clean samples as necessary because we sample the 
auxiliary channels at random times. In general, classifiers' performance will increase with larger training data sets, but the computational cost 
of large data sets will also increase. We investigate the effect of varying the size of training sets on the classifiers' performance, and observe 
only small changes even when we significantly reduce the number of clean samples. We also reduce the number of glitch samples, and observe 
that the classifiers are more sensitive to the number of glitches. This is likely due to the smaller number of glitch samples, and reducing the 
number of glitches may induce severe undersampling of feature space. The black dashed line represents a classifier that is based on random 
choice. 



wave signal. Clustering implies that among all candidates 
within the time window, only the one with highest statisti- 
cal significance will be retained. In order to avoid double 
counting of possibly correlated glitches and to replicate condi- 
tions similar to a real-life gravitational-wave search, we apply 
a clustering procedure to the glitch samples with a one second 
time window. In this time window, we keep the sample with 
the highest significance, p, of the transient in gravitational- 
wave channel. 

The ROC curves are computed after clustering. Figure |4] 
shows the ROC curves for ANN, SVM, RF and OVL for both 
S4 and S6 data. 

All our classifiers show comparable efficiencies in the most 
relevant range of the probability of false alarm for practical 
applications (10~^ - 10"^). Of the three MLA classifiers, RF 
achieves the best efficiency in this range, with ANN and SVM 
getting very close near Pq = 10^^. Relative to other classi- 
fiers, SVM performs worse in the case of S4 data, and ANN's 
efficiency drops fast at P < 10~^. The most striking fea- 
ture on these plots is how closely the RF and the OVL curves 



ure[4b]respectively). In absolute terms, the classifiers achieve 
significantly higher efficiency for S6 than for S4 data, 56% 
verses 30% at Pq — 10^^. We also note that the clustering 
procedure has more affect on the ROC curves in S4 than in 
S6 data. In the former case, the efficiency drops by 5 - 10% 
(compare to the curves in Figures 3a to 3c i, whereas in the 



follow each other in both S4 and S6 data (Figure 4a and Fig 



latter it stays practically unchanged (compare to Figures 3d 
to[3^. The reason for this is not clear In the context of detec- 
tor evolution, the S6 data are much more relevant for advanced 
detectors. At the same time, we should caution that we use 
just one week of data from the S6 science run and larger scale 
testing is required for evaluating the effect of the detector's 
non-stationaiity. 

The ROC curves characterize the bulk performance of the 
classifiers, but they do not provide information about what 
kind of glitches are identified. To gain further insight into 
the distribution of glitches before and after classification, 
we plot cumulative histograms of the significance, p, in the 
gravitational-wave channel for glitches that remain after re- 
moving those detected by each of the classifiers at Pq ^ 10~^. 
We also plot a histogram of all glitches before any glitch re- 
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FIG. 4. Comparing algorithmic performance. We directly compare 
the performance of RF (green), ANN (blue), SVM (red), and OVL 
(light blue) using the full data sets. Glitches are clustered in time. We 
see that all the classifiers perform similarly, particularly in S6. There 
is a general trend of higher performance in S6 than in S4, which we 
attribute to differences in the types of glitches present in the two data 
sets. We should also note that all the MLA classifiers achieve perfor- 
mance similar to our benchmark, OVL, but RF appears to perform 
marginally better for a large range of the False Alarm Probability. 
The dashed line corresponds to a classifier that is based on random 
choice. 



moval. These histograms are shown in Figure |5] They show 
the effect of each classifier on the distribution of glitches in 
the gravitational-wave channel. In both the S4 and S6 data 
sets, the tail of the glitch distribution, containing samples 
with highest significance, is reduced. At the same time, as 
is clear from the plots, many glitches in the mid range of sig- 
nificances are also removed, contributing to overall lowering 



of the background for transient gravitational-wave searches. 
The fact that our classifiers remove low significance glitches 
while some of the very high significance glitches are left be- 
hind indicates that there is no strong correlation between am- 
plitude of glitches in gravitational-wave channel and their de- 
tectability. This in turn implies that we either do not provide 
all necessary information for identification of these high sig- 
nificance glitches in the input feature vector or the classifiers 
somehow do not take advantage of this information. Given the 
close agreement between various classifiers that we observe in 
the ROC curves (Figure|4| and the histograms of glitch distri- 
butions (Figure |5]l, the former alternative seems to be more 
plausible. Alternatively, our choices of the thresholds and the 
coincidence windows that went into the construction of the 
feature vectors might not be optimal. Also, heretofore unin- 
cluded features characterizing the state of the detector, which 
may amplify transient disturbances in the auxiliary channels 
and induce glitches in the gravitational-wave channel, might 
be crucial for identifying glitches missed in the current analy- 
sis. We leave the investigation of these possibilities to future 
work. 

Although the ROC curves (Figure [4]) and the histograms 
(Figure |5]) provide strong evidence that all classifiers detect 
the same glitches, they do not give a clear quantitive picture of 
the overlap between these methods. To see this more clearly, 
we define subsets of glitches based on which combination of 
classifiers detected them with a probability of false alarm less 
than 10"^. We determine overlaps between the MLA classi- 
fiers by constructing a bit- word diagram (Figure |6]l. It clearly 
demonstrates a high degree of redundancy between the classi- 
fiers. The fraction of glitches detected by all three MLA clas- 
sifiers is 91.3% for S6 data and 78.4% for S4 data. For com- 
parison, we also construct a bit-word diagram for the clean 
samples, shown on the same figure, which are falsely identi- 
fied as glitches with probability of false alarm less than 10"^. 
The classifiers' predictions for clean samples are distributed 
almost uniformly. This suggests that our classifiers select 
clean samples nearly independently, or at least with a much 
lower level of correlation than for glitches. 

Next, we compare the MLA classifiers to OVL. In order to 
reduce the number of possible pairings, we combine the MLA 
classifiers following the maximum-Iikelihood-ratio algorithm 
described in more detail in the next Section [Vlll In short, this 
algorithm picks the most statistically significant prediction out 
of the three MLA classifiers for each event. We denote the 
combined classifier as MLAmax- As in the previous case, we 
construct the bit-word diagram for both glitch and clean sam- 
ples detected with the probability of false alarm less than 10"^ 
(Figure |7]i. The redundancy is even stronger The fraction of 
glitches detected by MLAmaxand OVL is 94.8% for S6 data 
and 85.2% for S4 data. The full bit-word histograms show the 
same behavior and we omit them here. 



VII. METHODS FOR COMBINING CLASSIFIERS 

On a fundamental level, the MLA classifiers search for a 
one parameter family of decision surfaces in the feature space. 
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FIG. 5. Comparing distribution of glitches before and after apply- 
ing classifiers at 1 % probability of false alarm. This cumulative 
histogram shows the number of glitches that remain with at least as 
high a significance in the gravitational-wave channel. We see that 
all our classifiers remove similar fractions of glitches at 1% proba- 
bility of false alarm. This corresponds to their similar performances 
in Figure|4] with efficiencies near 30% and 55% for S4 and S6 data, 
respectively. We also see that the classifiers tend to truncate the high- 
significance tails of the non-Gaussian transient distributions, partic- 
ularly in S6. 



X G Vd, by optimizing a detection criterion. The parameter 
labeling the decision surfaces can be mapped into a contin- 
ues rank, r'MLA(a^) G [0, !]• The rank reflects the odds for a 
sample, x, to correspond to a glitch in the gravitational-wave 
channel. As we discuss in Section III and Appendix [A] theo- 
retically, if the classifiers use consistent optimization criteria, 
they should arrive at the same optimal decision surfaces and 
make completely redundant predictions. In other words, their 
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(b) S6 bit-word histogram for MLAs 

FIG. 6. Redundancy between MLA classifiers. These histograms 
show the fractions of detected glitches identified in common by a 
given set of classifiers at 1% probability of false alarm (blue). The 
abscissa is labeled with bit-words, which are indicators of which 
classifier(s) found that subset of glitches (eg: Oil corresponds to 
glitches that were not found by ANN, but were found by RF and 
SVM). The quoted percentages represent the fractions of detected 
glitches so that 100% represents those glitches which were success- 
fully identified by at least one of the classifiers at 1% false-alarm 
probability. The three classifiers show large overlap for glitch identi- 
fication (bit-word — 111), meaning the classifiers are largely able to 
identify the same glitch events. Also shown is the fraction of clean 
samples (green) misidentified as glitches, which shows a compara- 
tively flat distribution across classifier combinations. 



ranks would be functionally dependent. In practice, however, 
different classifiers often lead to different results, primarily 
due to the limitations in the number of samples in the train- 
ing sets and/or computing resources. For instance, different 
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clean samples. Again, by combining the output of different 
classifiers, we may be able to extract information about these 
correlations and improve the total efficiency of our analysis. 

This last case appears to be applicable to our data set. From 
Section 



VI 



we see that at a probability of false alarm of 
1%, all classifiers remove nearly identical sets of glitches (to 
within 10% for the S6 data). However, the classifiers agree 
to a significantly lesser extent on the clean samples they re- 
move (Figure |6|l. This suggests that the correlations between 
the classifiers' predictions are different for glitches and clean 
samples, and combining the classifiers' output could possibly 
lead to an improved analysis. 

The general problem of combining the results from multi- 
ple, partially redundant analysis methods has been addressed 
in the context of gravitational- wave searches in ll20l . Treating 
the output of the classifiers, namely their ranks, as new data 
samples, one arrives at the optimal combined ranking given 
by the joint likelihood ratio: 
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FIG. 7. Redundancy between MLAmax and OVL. This figure is 
similar to Figure |6] except these histograms only compare the re- 
sults of combining the MLA classifiers into a single unified classifier 
(MLAinax) and OVL. Even though OVL only considers pairwise 
correlations between auxiliary channels and the gravitational-wave 
channel, we see that it identifies the same glitches as MLAmax- This 
suggests that the glitches identified by the MLA classifiers are effec- 
tively characterized by pairwise correlations between a single auxil- 
iary channel and the gravitational-wave channel, and that consider- 
ing multi-channel correlations does not add much. We also see that 
these classifiers are highly correlated on their selection of glitches 
(blue), but less correlated across the set of misidentified clean sam- 
ples (green). 



classifiers may be more or less sensitive to different types of 
glitches. In this case, one should be able to detect a larger set 
of glitches by combining their output. Furthermore, the clas- 
sifiers may be strongly correlated in the ranks they assign to 
glitch samples, but only weakly correlated when classifying 
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where r = (rANN, ''svmi ^rf) is the vector of the MLA ranks 
assigned to a sample, x, and p{f \ 1) and p{r \ 0) are the prob- 
ability density functions for the rank vector in the case of 
glitch and clean samples, respectively. We should point out 
that we can modify this ranking by multiplying by the ratio of 
prior probabilities (p(l)/p(0)) to match the rankings for in- 
dividual classifiers without affecting the ordering assigned to 
samples. Typically, these conditional probability distributions 
are not known and computing the joint likelihood ratio from 
first principles is not possible. One has to develop a suitable 
approximation. We try several different approximations when 
combining algorithms. 

Our first approximation, and perhaps the simplest, esti- 
mates the likelihood ratio for each classifier separately and 
assigns the maximum to the sample. This method should be 
valid in the two limits: extremely strong correlations and ex- 
tremely weak correlations between the classifiers. It was first 
suggested and applied in the context of combining results of 
multiple gravitational-wave searches in ll20ll . We estimate the 
individual likelihood ratios in two ways: 1) as the ratio of 
cumulative density functions (cdf) and 2) as the ratio of kernel 
density estimates for the probability density function (pdf). 
Though a proper estimate should involve the pdfs, the advan- 
tage of using cdfs is that we already calculate them when eval- 
uating the efficiency and probability of false alarm for each 
classifier. They should approximate the ratio of pdfs reason- 
ably well in the tail of the distributions, when the probability 
of false alarm is low. This assumes that pdfs are either slowly 
varying or simple (e.g. power law or exponential) decaying 
functions of the rank. However, at large values of the prob- 
ability of false alarm or in the case when the probability dis- 
tributions exhibit complicated functional dependence on the 
rank, our approximation may break down and we will have to 
resort to the more fundamental ratio of the pdfs. Explicitly, 
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we estimate the joint likelihood ratio using 



|0)dr' 



(14) 



We refer to this method as MLA^axWhen introducing it in the 
context of Figure |7] 

We also construct smooth one-dimensional pdfs for clean 
and glitch samples from their ranks using Gaussian kernel 
density estimation |41J. These estimates were built using a 
constant bandwidth equal to 0.05 in the rank space, which 
ranges from to 1. Based on this, we define the approximate 
combined rankings: 



Pjrj I 1) 
p{r, I 0) 



(15) 



It is by no means true that we can always approximate the 
multi-dimensional likelihood ratio ( pjj ) with the maximum 
over a set of one-dimensional likelihood ratios. If we can 
better model the multi-dimensional probability distributions, 
we should be able to extract more information. To this end, 
we also implement a slightly more complicated combining 
algorithm. We observe that the algorithms are highly corre- 
lated on which glitches they remove, and less correlated on 
the clean samples (see Figure |6]l. We therefore approximate 
p{f\l) w maxr^.{p(rj | 1)} and p(f | 0) w UjPi^j I 0)' 
which assumes that the algorithms are completely uncorre- 
cted for the clean samples. Ajoint is then approximated by 



axr, {pjrj I 1)} 

U^Pin |0) 



(16) 



Again, we compute the individual pdfs using Gaussian kernel 
density estimation. 

More subtle, but still useful, correlations between the ranks 
assigned by different classifiers can not be accounted for by 
these simple analytical approximations. Estimating the multi- 
dimensional probability distributions is a difficult task and 
under-sampling quickly becomes the dominant source of error 
when expanding to higher than two dimensions. Rather than 
developing a complicated analytic model, we can use one of 
the MLA classifiers to compute the combined rank. We use 
RF to attempt to combine the ranks from each classifier and 
construct an estimate of the full (three-dimensional) joint like- 
lihood ratio. 

We compare the methods for combining the classifiers by 
computing ROC curves, which are shown in Figure |8] We 
reproduce only the S6 curves because the S4 data shows the 
same trends. 

All combined methods result in very similar ROC curves 
and, when compared to the OVL curve, they do not seem to 
improve the overall performance by more than a few percent. 
These combined results lead us to conclude that the individual 
classifiers have already reached nearly optimal performance 
for the given input data, and that their combination, while 
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FIG. 8. Comparison of different combining algorithms using S6 data. 
This figure compares the performance of our various schemes for 
combining the output of the three MLA classifiers. We note that all 
four algorithms, Li (Equation jl4[)), L2 (Equation |T5j), L3 (Equa- 
tion ([T|J), and using RF to classify events based on the MLA output 
vector r, agree to a remarkable degree. The fact that our simple an- 
alytic algorithms perform just as well as the RF suggests that there 
are not many subtle correlations between the classifiers' output. The 
MLA combining algorithms do not perform much better than OVL. 
Comparing these curves with Figure|4]shows that the combined per- 
formance does not exceed the individual classifier's performances. 
This suggest that the individual MLA classifiers each extract almost 
all of the useful information from our feature vectors, and that they 
identify the same types of glitches. These conclusions are further 
supported by Figure |6] 



increasing their robustness, can not improve the overall effi- 
ciency. Basically, all the useful information has been extracted 
aheady. 

Although it is not immediately apparent, these combining 
schemes do add robustness to our identification of glitches. 
The combining algorithms are able to ignore underperform- 
ing classifiers and reject noisy input fairly well, and we see 
that they tend to select the best performance from the indi- 
vidual classifiers. By comparing Figure [8] with Figure |4] we 
see that the combining algorithms follow the best ROC curve 
from figure Figure |4] even when individual classifiers are not 
performing equitably. This is most evident at extremely low 
probabilities of false alarm. This robustness is important be- 
cause it can protect a combined glitch identification algorithm 
from bugs in a single classifier. In this way, the combining 
algorithm essentially acts as an automatic cross-reference be- 
tween individual MLA classifiers. 



VIII. CONCLUSION 

In this study, we apply various machine learning algo- 
rithms to the problem of identifying transient noise artifacts 
(glitches) in gravitational-wave data from LIGO detectors. 
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Our main goal is to establish the feasibility of using MLAs for 
robust detection of instrumental and environmental glitches 
based on information from auxiliary detector channels. We 
consider several MLA classifiers: the Artificial Neural Net- 
works, the Support Vector Machine, and the Random Forest. 
We formulate the general detection problem in the context of 
glitch identification using auxiliary channel information and 
show that, theoretically, all classifiers lead to the same opti- 
mal solution. In a real-life application, our classifiers have to 
monitor a large number of channels. Even after data reduc- 
tion, the dimensionality of our feature space could be as high 
as 1250, making classification a truly challenging task. We 
test classifiers using data sets from S4 and S6 LIGO scientific 
runs. We use standard ROC curves as the main figure of merit 
to evaluate and compare the classifiers' performances. 

Our tests show that the classifiers can handle extraneous 
features efficiently without affecting their performance. Like- 
wise, we find that the classifiers are generally robust against 
changes in the size of the training set. The most important 
result of our investigation is the confirmation that the MLA 
classifiers can be used to monitor a large number of auxiliary 
channels, many of which might be irrelevant or redundant, 
without a loss of efficiency. These classifiers can be used to 
develop a real-time monitoring and detector characterization 
tool. 

After establishing the robustness of the classifiers against 
changes in the input data and the presence of nuisance pa- 
rameters, we evaluate the algorithms' performance in terms 
of the ROC curve and carry out detailed comparison between 
the classifiers. This includes their impact on the overall dis- 
tribution of glitches in the gravitational-wave channel and the 
redundancy of their predictions. We find that at a false alarm 
probability of 1%, all classifiers demonstrate comparable per- 
formance and achieve 30% and 56% efficiency at identify- 
ing single-detector glitches above our nominal threshold when 
tested on the S4 and S6 data respectively. While not superb, 
this is a step toward the ultimate goal of being able to remove 
all artifacts from the data. 

One advantage of the MLA classifiers is that they provide 
a continues rank, tmla G [0,1], rather than a binary flag. 
This statistic can be incorporated directly into the searches for 
gravitational-waves as a parameter characterizing a candidate 
event along with the rest of the data from the gravitational- 
wave channel, as opposed to a standard approach of vetoing 
entire segments of flagged data based on a hard threshold on 
data quality. 

Another advantage of the MLA classifiers is that they can 
incorporate various, potentially diverse types of information 
and establish correlations between multiple parameters. In ad- 
dition to the transient noise data used in this study, we plan to 
include more slowly-varying baseline information about the 
detector subsystems in the future. For example, the quality of 
alignment in the interferometer may be important for predict- 
ing the amount of noise that couples into the gravitational- 
wave channel from elsewhere in the instrument. Machine 
learning should be able to automatically identify such non- 
linear correlations, even if they are not known previously. 

As a final test, we explore several ways of combining the 



output of several classifiers in order to increase the robust- 
ness of their predictions and possibly improve combined ef- 
ficiency. Following general principles for combing multiple 
analysis methods, we suggest several approximations for the 
optimal combined ranking given by the joint likelihood ratio. 
We test our approximations and find that they perform simi- 
larly to and do not improve upon the efficiencies of individual 
classifiers. 

Based on these results, we conclude that the three MLA 
classifiers used in this study: ANN, SVM, and RF, are all 
able to achieve robust and competitive classification perfor- 
mance for our set of data. The RF classifier was the most 
robust against the form (range, shape, scaling, number) of in- 
put data, while ANN and SVM benefit from reshaping certain 
input parameters along physical arguments. Since all classi- 
fiers achieve similar limiting performance and identify most 
of the same events, we conclude that they are near-optimal in 
their use of the existing data. Future improvement in classi- 
fication efficiency is therefore likely to come from including 
additional sources of useful information, rather than refine- 
ments to the algorithms themselves. 
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Appendix A: Derivation of tlie Decision Surfaces for some MLA 
Optimization Schemes 

In Section [Hi] we stress that for the detection (or the two- 
class classification) problem, the most natural optimizing cri- 
terion is the Neyman-Pearson criterion, which requires max- 
imum probability of detection at fixed probability of false 
alarm. Optimizing this criterion, which in functional form is 
given by Q, leads to the one-parameter family of decision 
surfaces defined as surfaces of constant likelihood ratio (|5]l. 
Each decision surface is labeled by a corresponding value of 
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the likelihood-ratio, A(a;), providing a natural ranking for un- 
classified events. The higher the likelihood-ratio, the more 
likely it is that event belongs to Class 1 . The likelihood-ratio 
can be mapped to the particular value of the false alarm prob- 
ability, Po{A), which assigns it a statistical significance. In 
practice, it is often more convenient to define ranking in terms 
of some monotonic function of the likelihood-ratio, r(A) (eg: 
r(A) = In A). Classifying and ranking samples based on 
the likelihood-ratio is guaranteed to maximize the ROC curve 
Pi{Po). 

In their standard configurations, most MLA classifiers ap- 
ply other kinds of optimization criteria (e.g. the fraction 
of correctly classified events or the Gini index). Many of 
these criteria treat the two classes of the events symmetri- 
cally, which often is more appropriate than the asymmetric 
Neyman-Pearson criterion. In this section, we would like to 
explore some of the most popular criteria used by the MLA 
classifiers in their relation to the Neyman-Pearson criterion. 
Specifically, we are interested in establishing consistency be- 
tween the various criteria used in this study, in that they lead to 
the same optimal decision surfaces and compatible rankings. 



1. The Fraction of Correctly Classified Events 

First, we consider probably the most popular criterion - the 
fraction of correctly classified events. This criterion is used by 
both ANN and SVM in our analysis. Following the approach 
of section|lIl) we define it as a functional of the decision func- 
tion, f{x), on the feature space, V^: 



C= I e{f{x)-F*)p{l \ x)p{x)dx 



Here p{x \ 1) and p{x \ 0), defined in Section III are the prob 



Q{F* - f{x))p{0 \x)p{x)dx, 



(Al) 



where p{l \ x) and p{0 \ x) are the probabilities for a sample 
with feature vector x to be classified as Class 1 and Class 0, 
respectively, andp{x) is the probability distribution of obtain- 
ing a feature vector, x, regardless of its class. To elucidate 
this expression, we note that p{c \ x)p{x) dx corresponds to 
the fraction of total events that fall in the hyper-volume dx 
and are of class c. Without the Q functions, these integrals 
would evaluate to C — p{l) + p{0) = 1, and we see that the 
Q functions select only those events that are correctly classi- 
fied. With this interpretation, f{x) is the decision function. 
For a given threshold F*, it defines two regions of samples 
as Class 1 and Class through Q {f{x) — F*) and its com- 
plement 8 {F* — f{x)), respectively. Thus, the first term in 
(All accounts for correctly classified events of Class 1 and the 



second term does the same for Class events. 

Using Bayes theorem, one can express p{l \ x)p{x) 
p{Q I x)p{x) in the first and second term of ( Al i as: 



and 



p{l I x)p{x) = p{x I l)p(l) , (A2a) 
p(0 I x)p{x) = p{x I O)p(O) . (A2b) 



ability density functions for feature vectors in the presence 
and absence of a glitch in gravitational-wave data, respec- 
tively. p{l) and p(0) are the prior probabilities for having a 
glitch or a clean datum, related to each other viap(l)+p(0) — 
1 as always. The fraction of correctly classified events is then 
given by 

C = f e{f{x)-F*)p{x\l)p{l)dx 



+ e (F* - fix))pix I O)p(O) dx . (A3) 

The requirement that the variation of C with respect to f{x) 
must vanish. 



(A4) 



6C^ / 6if{x)-F*)Sf{x) 

X [p{x I l)p(l) - p{x I O)p(O)] da; = . 



leads to the following condition for the points, x*, satisfying 

f{x*)-F* = 0: 



p{x* \l)p{l)-p{x* |0M0)=0. 



(A5) 



This equation defines the decision surface consisting of points 
for which 



A(x*) 



P(l) _ P{x* I l)p(l) 
p(0) pix* I O)p(O) 



= 1 



(A6) 



Thus, optimizing the fraction of correctly classified events 
leads to a specific decision surface, for which the likelihood 
ratio, A(a;*) = p{0)/p{l). By construction, the optimization 
criterion ( |A1[ ) treats events of the both classes symmetrically, 
maximizing the number of correctly classified events. As a 
result, it selects a specific decision surface for which evidence 
for the sample to belong to either one class or the other are 
equal. 



2. The Gini Index 

Next, we consider the Gini index criterion, which is used in 
RF. For two-class problems, the Gini index of a region in the 
feature space, V, is defined as 



G(y) = i-p2 



2pq 



(A7) 



where p and q are fraction of Class 1 and Class samples in 
the region, with p+q — 1. The Gini index for multiple regions 
is given by the average 



G = ^G(y.)pm) 



(A8) 



where p{Vi) is the probability for a sample to be in the region 
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For the two-class problem, there are two distinct regions - 
the region where samples are classified as Class l,Vi, and the 
region where samples are classified as Class 0, Vq. We make 
the following definitions: 

P = / e{f)p{l\x)pix)dx^ f Q{f)p{x\l)p{l)dx, 

(A9a) 



0= / e{f)p{o\x)pix)dx^ / e{f)p{x\o)p{o)dx, 

(A9b) 

where P (Q) is the probability that a glitch (clean sample) 
will fall into Vi and e(/) is shorthand for 6 {f{x) ~ F*). 
We recognize that the expected fractions of events in Vi can 
be described as p = P/p{Vi) and q = Q/p{Vi), where 
p{Vi) = Jy^'d{f)p{x) dx is the probability for any event (ei- 
ther glitch or clean sample) to fall in Vi. Furthermore, we can 
immediately write the corresponding relations for Vq in terms 
of P, Q, and p{Vi) = 1 - p{Vo). We then obtain: 



G 
2" 



PQ , {p(l) - P) {p{0) - Q) 



(AlO) 



where p{l) and p{Q) are the prior probabilities for an event to 
be Class 1 and Class 0, respectively. We also note that p(Vi), 
P, and Q are functionally related through 



p{Vi 



eif)pix)dx 



= / e(/) b(l I x)p{x)+p{0 I x)pix)] dx 
^P + Q. (Alia) 



We now optimize the Gini index, (Equation ( |A10| i), to de- 
termine the optimal decision surface. For simplicity, let us 
first optimize G while holding p{Vi) constant. We are in- 
terested in the optimal decision surface's shape, and this op- 
timization will determine it while keeping the ratios of the 
number of samples in Vi and Vq constant. We obtain 

1 SG _ p{x*)Q + q{x*)P p{x*)Q + q{x*)P 
2Sf{x)~ p(y,) + l~p{Vi) 
pil)qix*)+p{0)pix*) 



l-p(Fi) 



(A12) 



where 



Pix*) 



q{x*) 



5P 



S.,.' / Sif{x)-F*)pix\l)pil)dx, 

(A 13a) 

j^,=5,..' f S{f{x)-F*)p{x\0)piO)dx, 

(A13b) 



where S^.x' = Sf{x)/6f{x*) ~ {1 if x* = a;,0 otherwise}. 
We see that p{x*) and q{x*) are probability density functions 
defined on the decision surface {x* G {f{x*) = F*}). We 



can write these variations in terms of conditional probabilities 
on the decision surface. 



p{x*) 
q{x*) 



p{x* 
p{x* 



O)p(O) . 



(A14a) 
(A14b) 



Requiring the variation of the Gini index, ( A12 1, to vanish 
leads to the following relation 



p{x*) _ p{l)p{Vi)-P 



q{x* 



p{o)p{y^)^Q 



(A15) 



Identifying the left hand side of this equation with the 
likelihood ratio for a point on the decision surface, 

K[x*) {p{l)/p[Q)) = p{x* I l)p{l)/p{x* I O)p(O), and using 
( |A11[ ) we find that 



K{x*) 



p(0) 



1 



(A16) 



which will hold for all points on the decision surface. Re- 
markably, this condition is independent of P, Q and p{Vi). 
As in the case of the fraction of correctly classified events 



( A6 1, it implies that the optimal decision surface is the surface 
on which the likelihood ratio is equal to a ratio of the priors. 

If we consider the more general maximization problem, in 
which we allow p{Vi) to vary, we must maximize 



PQ , - P) (p(0) - 0) 



2 piV.)' 1-piV.) 

(A17) 

where we use a lagrange multiplier (A) to enforce the condi- 
tion p{Vi) — P — Q — and consider variations of p( Vi )to be 
independent of variations in f{x). 

The variation with respect to p{Vi) defines the lagrange 
multiplier. 



A 



PQ (1 - 2p{V,)) - pjV.f [p(l)p(0) ~ p{l)Q - p{0)P] 

p{v^ni-p{v^)f 

(A18) 

Variation with respect of f{x) is given by (A12i minus 
X{p{x*) + q{x*)). Setting it to zero for all independent varia- 
tions of f{x*) leads to a more general condition on the likeli- 
hood ratio: 



A(a;* 



\p{Vi){l-p(Vi))+pil)piV,)-P 
Xp{V,)il-piVi))+p{0)p{V,)~Q 



(A19) 



which holds separately for all points x* on the decision sur- 
face. 

First of all, note that this condition still requires a constant 
likelihood-ratio on the decision surface. For each point on the 
surface, A(x*) is determined by P and Q, which are constants 
for a given surface. We recover (A15 i and ( A16 1 when A = 0. 



In all other cases, the likelihood ratio on the decision surface 
is given by a quite complicated expression, obtained by plug- 



ging ( A18 I and (All i into (A19 1. In practice, the likelihood 



ratio on the decision surface is set by the desired value for the 
probability of false alarm, Pq, but the ratio will be constant 
over the entire surface. Optimization of the Gini index, then, 
will be equivalent to optimizing the ROC curve in the region 
of interest, e.g. in our study near Pq = 10~^. 
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3. Asymmetric Criteria 

Both criteria considered so far are symmetric in their treat- 
ment of events in Class 1 and Class 0. While the asymme- 
try can be imposed by tuning the ratio of prior probabilities 
(p(l)/p(0)), in some cases it might be desirable to use an ex- 
plicitly asymmetric criteria. The RF implementation in the 
StatPatternRecognition package contains two different asym- 
metric criteria, which we explore in our study: Signal Purity 
(P) and Signal Significance (S) |40|. By construction, they 
identify one class of events as signal and the other as noise 
and place more emphasis on correctly classifying signal rather 
than noise. 

The Signal Purity and the Signal Significance are defined 



P = 



S = 



^1 

UJl +ujo ' 

UJl 



(A20) 
(A21) 



where uji and ujq are the fraction of Class 1 (signal) and Class 
(noise) events in the signal region. The aim is to identify 
the signal region with the highest Signal Purity or Signal Sig- 
nificance. In the process of the decision tree construction, the 
classifier identifies the terminal nodes with the highest values 
of P or 5 and orders nodes using these criteria as a rank. For 



a given terminal node of a tree, one can express wi and ujq in 
terms of conditional probabilities: 



UJl ^p{x I 1) , 
ujQ = p{x I 0) . 



(A22) 
(A23) 



In terms of these, one can rewrite the signal purity and the 
signal significance as 



P 



pix I 1) 



A(x) 



p{x 



l)+p{x |G) 
p{x 1 1) 



Aix) + 1 



A{x) 



^p{x\ir+pix I 0)2 ^A^{x) + 1 



(A24) 
(A25) 



Both quantities are monotonic functions of the likelihood ratio 
A(.t) = p{x I l)/p{x I 0). Ordering nodes in the tree would 
be equivalent to ranking events by the likelihood ratio, which 
in turn is equivalent to using decision surfaces of the constant 
likelihood ratio for classification. 

In practice, different classifiers have various limitations 
which results in suboptimal performance. Depending on the 
application, one algorithm or criteria may be more optimal 
than another, but we establish here that on a theoretical level 
they all recover the same optimal solution. In the two-class 
classification problem, the decision surfaces are surfaces of 
constant likelihood ratio, which also defines the optimal rank- 
ing for samples. 
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